Business Problem
With the onset of the pandemic in 2020, The way India eats has changed. The penetration of online food services in India is set to double by 2025. Restaurants will have to soon find a way to control their operating expenses such as rentals, electricity, and manpower costs. Cloud kitchen or delivery only outlets are a viable option for restaurants to explore now. This comes with its challenges; people now are ordering food based on reviews and ratings. Various factors affect the rating of a restaurant apart from quality and taste like the average cost of food, cuisines offered, restaurant type, location etc. new restaurants that have just started up might lack the necessary data to forecast how their restaurant is going to perform and they do trial and error to figure out what works to get good reviews, this takes up a lot of time and resources which many small restaurants can’t afford and thus they might not be able to compete with well-known established restaurants and this might result in restaurants failure. Also, new restaurants face a lot of questions starting from deciding on the location to choosing the type of cuisines to offer. This project helps solve that issue by analyzing the data from Zomato to predict a rating based on certain features and understand the taste of the Bengalureans thus helping restaurants succeed by making the right choices.
Variables & Descriptions:
Variables
Description
url:
contains the url of the restaurant in the zomato website
address:
contains the address of the restaurant in Bangalore
name:
contains the name of the restaurant in Bangalore
online_order:
whether online ordering is available in the restaurant or not
book_table:
table book option is available in the restaurant or not
rate:
contains the overall rating of the restaurant out of 5
votes:
contains total number of rating for the restaurant
phone:
contains the phone number of the restaurant
location:
contains the neighbourhood in which the restaurant is located
rest_type:
restaurant type
dish_liked:
dishes liked by people in the restaurant
cuisines:
food styles offered
approx_cost(for two people):
contains the approximate cost of meal for two people
reviews_list:
list of tuples containing reviews for the restaurant, each tuple
menu_item:
contains list of menus available in the restaurant
listed_in(city):
contains the neighborhood in which the restaurant is listed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
from plotly.offline import iplot
from plotly import tools
import plotly.io as pio
pio.renderers.default = "notebook+pdf"
from warnings import filterwarnings
filterwarnings('ignore')
df = pd.read_csv(r'D:\Data_science\Projects\Imarticus\Capstone\Dataset\zomato.csv\zomato.csv')
#visualizing the first 5 rows of the dataset to get a overview of the data
df.head()
| url | address | name | online_order | book_table | rate | votes | phone | location | rest_type | dish_liked | cuisines | approx_cost(for two people) | reviews_list | menu_item | listed_in(type) | listed_in(city) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.zomato.com/bangalore/jalsa-banasha... | 942, 21st Main Road, 2nd Stage, Banashankari, ... | Jalsa | Yes | Yes | 4.1/5 | 775 | 080 42297555\r\n+91 9743772233 | Banashankari | Casual Dining | Pasta, Lunch Buffet, Masala Papad, Paneer Laja... | North Indian, Mughlai, Chinese | 800 | [('Rated 4.0', 'RATED\n A beautiful place to ... | [] | Buffet | Banashankari |
| 1 | https://www.zomato.com/bangalore/spice-elephan... | 2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ... | Spice Elephant | Yes | No | 4.1/5 | 787 | 080 41714161 | Banashankari | Casual Dining | Momos, Lunch Buffet, Chocolate Nirvana, Thai G... | Chinese, North Indian, Thai | 800 | [('Rated 4.0', 'RATED\n Had been here for din... | [] | Buffet | Banashankari |
| 2 | https://www.zomato.com/SanchurroBangalore?cont... | 1112, Next to KIMS Medical College, 17th Cross... | San Churro Cafe | Yes | No | 3.8/5 | 918 | +91 9663487993 | Banashankari | Cafe, Casual Dining | Churros, Cannelloni, Minestrone Soup, Hot Choc... | Cafe, Mexican, Italian | 800 | [('Rated 3.0', "RATED\n Ambience is not that ... | [] | Buffet | Banashankari |
| 3 | https://www.zomato.com/bangalore/addhuri-udupi... | 1st Floor, Annakuteera, 3rd Stage, Banashankar... | Addhuri Udupi Bhojana | No | No | 3.7/5 | 88 | +91 9620009302 | Banashankari | Quick Bites | Masala Dosa | South Indian, North Indian | 300 | [('Rated 4.0', "RATED\n Great food and proper... | [] | Buffet | Banashankari |
| 4 | https://www.zomato.com/bangalore/grand-village... | 10, 3rd Floor, Lakshmi Associates, Gandhi Baza... | Grand Village | No | No | 3.8/5 | 166 | +91 8026612447\r\n+91 9901210005 | Basavanagudi | Casual Dining | Panipuri, Gol Gappe | North Indian, Rajasthani | 600 | [('Rated 4.0', 'RATED\n Very good restaurant ... | [] | Buffet | Banashankari |
#visualizing the last 5 rows of the dataset to get a overview of the data
df.tail()
| url | address | name | online_order | book_table | rate | votes | phone | location | rest_type | dish_liked | cuisines | approx_cost(for two people) | reviews_list | menu_item | listed_in(type) | listed_in(city) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 51712 | https://www.zomato.com/bangalore/best-brews-fo... | Four Points by Sheraton Bengaluru, 43/3, White... | Best Brews - Four Points by Sheraton Bengaluru... | No | No | 3.6 /5 | 27 | 080 40301477 | Whitefield | Bar | NaN | Continental | 1,500 | [('Rated 5.0', "RATED\n Food and service are ... | [] | Pubs and bars | Whitefield |
| 51713 | https://www.zomato.com/bangalore/vinod-bar-and... | Number 10, Garudachar Palya, Mahadevapura, Whi... | Vinod Bar And Restaurant | No | No | NaN | 0 | +91 8197675843 | Whitefield | Bar | NaN | Finger Food | 600 | [] | [] | Pubs and bars | Whitefield |
| 51714 | https://www.zomato.com/bangalore/plunge-sherat... | Sheraton Grand Bengaluru Whitefield Hotel & Co... | Plunge - Sheraton Grand Bengaluru Whitefield H... | No | No | NaN | 0 | NaN | Whitefield | Bar | NaN | Finger Food | 2,000 | [] | [] | Pubs and bars | Whitefield |
| 51715 | https://www.zomato.com/bangalore/chime-sherato... | Sheraton Grand Bengaluru Whitefield Hotel & Co... | Chime - Sheraton Grand Bengaluru Whitefield Ho... | No | Yes | 4.3 /5 | 236 | 080 49652769 | ITPL Main Road, Whitefield | Bar | Cocktails, Pizza, Buttermilk | Finger Food | 2,500 | [('Rated 4.0', 'RATED\n Nice and friendly pla... | [] | Pubs and bars | Whitefield |
| 51716 | https://www.zomato.com/bangalore/the-nest-the-... | ITPL Main Road, KIADB Export Promotion Industr... | The Nest - The Den Bengaluru | No | No | 3.4 /5 | 13 | +91 8071117272 | ITPL Main Road, Whitefield | Bar, Casual Dining | NaN | Finger Food, North Indian, Continental | 1,500 | [('Rated 5.0', 'RATED\n Great ambience , look... | [] | Pubs and bars | Whitefield |
#understanding the shape of the dataset
print("The dimension of the dataset: ", df.shape)
print("Number of features: ", df.shape[1])
print("Number of rows: ", df.shape[0])
The dimension of the dataset: (51717, 17) Number of features: 17 Number of rows: 51717
#renaming the columns for better readabilty
df = df.rename(columns={'approx_cost(for two people)':'approx_cost','listed_in(type)':'type','listed_in(city)':'listed_city'})
# checking the names of all the features
df.columns
Index(['url', 'address', 'name', 'online_order', 'book_table', 'rate', 'votes',
'phone', 'location', 'rest_type', 'dish_liked', 'cuisines',
'approx_cost', 'reviews_list', 'menu_item', 'type', 'listed_city'],
dtype='object')
# Checking the datatype and presence of null values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51717 entries, 0 to 51716 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 url 51717 non-null object 1 address 51717 non-null object 2 name 51717 non-null object 3 online_order 51717 non-null object 4 book_table 51717 non-null object 5 rate 43942 non-null object 6 votes 51717 non-null int64 7 phone 50509 non-null object 8 location 51696 non-null object 9 rest_type 51490 non-null object 10 dish_liked 23639 non-null object 11 cuisines 51672 non-null object 12 approx_cost 51371 non-null object 13 reviews_list 51717 non-null object 14 menu_item 51717 non-null object 15 type 51717 non-null object 16 listed_city 51717 non-null object dtypes: int64(1), object(16) memory usage: 6.7+ MB
It can be seen from the above output that some features contain null values we need to treat those values
# Checking count and percentage of null values of each feature
total = df.isnull().sum().sort_values(ascending=False)
percent = ((df.isnull().sum()/df.isnull().count()*100)).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(17)
| Total | Percent | |
|---|---|---|
| dish_liked | 28078 | 54.291626 |
| rate | 7775 | 15.033741 |
| phone | 1208 | 2.335789 |
| approx_cost | 346 | 0.669026 |
| rest_type | 227 | 0.438927 |
| cuisines | 45 | 0.087012 |
| location | 21 | 0.040606 |
| type | 0 | 0.000000 |
| menu_item | 0 | 0.000000 |
| reviews_list | 0 | 0.000000 |
| url | 0 | 0.000000 |
| address | 0 | 0.000000 |
| votes | 0 | 0.000000 |
| book_table | 0 | 0.000000 |
| online_order | 0 | 0.000000 |
| name | 0 | 0.000000 |
| listed_city | 0 | 0.000000 |
# visualising null values
sns.heatmap(df.isnull())
plt.show()
'dish_liked' has about 54% of missing values if we try to handle this missing value we will introduce bias. Hence we are dropping the column
#dropping the 'dish_liked' column
df.drop('dish_liked', axis = 1, inplace = True)
df.shape # we have succesfully dropped the 'dish_liked' column
(51717, 16)
url and phone does not add any value to the rating hence we will drop those columns as well
df.drop(['url', 'phone'], axis = 1, inplace = True)
df.head()
| address | name | online_order | book_table | rate | votes | location | rest_type | cuisines | approx_cost | reviews_list | menu_item | type | listed_city | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 942, 21st Main Road, 2nd Stage, Banashankari, ... | Jalsa | Yes | Yes | 4.1/5 | 775 | Banashankari | Casual Dining | North Indian, Mughlai, Chinese | 800 | [('Rated 4.0', 'RATED\n A beautiful place to ... | [] | Buffet | Banashankari |
| 1 | 2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ... | Spice Elephant | Yes | No | 4.1/5 | 787 | Banashankari | Casual Dining | Chinese, North Indian, Thai | 800 | [('Rated 4.0', 'RATED\n Had been here for din... | [] | Buffet | Banashankari |
| 2 | 1112, Next to KIMS Medical College, 17th Cross... | San Churro Cafe | Yes | No | 3.8/5 | 918 | Banashankari | Cafe, Casual Dining | Cafe, Mexican, Italian | 800 | [('Rated 3.0', "RATED\n Ambience is not that ... | [] | Buffet | Banashankari |
| 3 | 1st Floor, Annakuteera, 3rd Stage, Banashankar... | Addhuri Udupi Bhojana | No | No | 3.7/5 | 88 | Banashankari | Quick Bites | South Indian, North Indian | 300 | [('Rated 4.0', "RATED\n Great food and proper... | [] | Buffet | Banashankari |
| 4 | 10, 3rd Floor, Lakshmi Associates, Gandhi Baza... | Grand Village | No | No | 3.8/5 | 166 | Basavanagudi | Casual Dining | North Indian, Rajasthani | 600 | [('Rated 4.0', 'RATED\n Very good restaurant ... | [] | Buffet | Banashankari |
df.shape
(51717, 14)
# checking the available features after dropping the unnecessary features
df.columns
Index(['address', 'name', 'online_order', 'book_table', 'rate', 'votes',
'location', 'rest_type', 'cuisines', 'approx_cost', 'reviews_list',
'menu_item', 'type', 'listed_city'],
dtype='object')
'rate' column has 15% missing values and also its is in string format which is not ideal. Hence we will convert rate into float format
# checking unique values in rate column
df.rate.unique()
array(['4.1/5', '3.8/5', '3.7/5', '3.6/5', '4.6/5', '4.0/5', '4.2/5',
'3.9/5', '3.1/5', '3.0/5', '3.2/5', '3.3/5', '2.8/5', '4.4/5',
'4.3/5', 'NEW', '2.9/5', '3.5/5', nan, '2.6/5', '3.8 /5', '3.4/5',
'4.5/5', '2.5/5', '2.7/5', '4.7/5', '2.4/5', '2.2/5', '2.3/5',
'3.4 /5', '-', '3.6 /5', '4.8/5', '3.9 /5', '4.2 /5', '4.0 /5',
'4.1 /5', '3.7 /5', '3.1 /5', '2.9 /5', '3.3 /5', '2.8 /5',
'3.5 /5', '2.7 /5', '2.5 /5', '3.2 /5', '2.6 /5', '4.5 /5',
'4.3 /5', '4.4 /5', '4.9/5', '2.1/5', '2.0/5', '1.8/5', '4.6 /5',
'4.9 /5', '3.0 /5', '4.8 /5', '2.3 /5', '4.7 /5', '2.4 /5',
'2.1 /5', '2.2 /5', '2.0 /5', '1.8 /5'], dtype=object)
We see that there are values like 'NEW', '-', and nan values they cannot be converted to float, we can't remove the data as they might be new restaurants without ratings so we will write a function accept them as exceptions and convert them to nan values
# '/5' needs to be removed as it does not add any value and make readability difficult and we shall create a new column
# called rating for extracted values
df['rating'] = df['rate'].astype(str).apply(lambda x: x.split('/')[0])
# splitting the string at '/' and selecting only the number at 0 index
df.rating
0 4.1
1 4.1
2 3.8
3 3.7
4 3.8
...
51712 3.6
51713 nan
51714 nan
51715 4.3
51716 3.4
Name: rating, Length: 51717, dtype: object
# loop to convert string values to float
while True:
try:
df['rating'] = df['rating'].astype(float)
break
except ValueError as e:
error = str(e).split(":")[-1].strip().replace("'", "")
print(f'Exception occured while converting to float: {error}')
df['rating'] = df['rating'].apply(lambda x: x.replace(error, str(np.nan)))
Exception occured while converting to float: NEW Exception occured while converting to float: -
We have converted 'NEW' and '-' values into NaN values
df.head()
| address | name | online_order | book_table | rate | votes | location | rest_type | cuisines | approx_cost | reviews_list | menu_item | type | listed_city | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 942, 21st Main Road, 2nd Stage, Banashankari, ... | Jalsa | Yes | Yes | 4.1/5 | 775 | Banashankari | Casual Dining | North Indian, Mughlai, Chinese | 800 | [('Rated 4.0', 'RATED\n A beautiful place to ... | [] | Buffet | Banashankari | 4.1 |
| 1 | 2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ... | Spice Elephant | Yes | No | 4.1/5 | 787 | Banashankari | Casual Dining | Chinese, North Indian, Thai | 800 | [('Rated 4.0', 'RATED\n Had been here for din... | [] | Buffet | Banashankari | 4.1 |
| 2 | 1112, Next to KIMS Medical College, 17th Cross... | San Churro Cafe | Yes | No | 3.8/5 | 918 | Banashankari | Cafe, Casual Dining | Cafe, Mexican, Italian | 800 | [('Rated 3.0', "RATED\n Ambience is not that ... | [] | Buffet | Banashankari | 3.8 |
| 3 | 1st Floor, Annakuteera, 3rd Stage, Banashankar... | Addhuri Udupi Bhojana | No | No | 3.7/5 | 88 | Banashankari | Quick Bites | South Indian, North Indian | 300 | [('Rated 4.0', "RATED\n Great food and proper... | [] | Buffet | Banashankari | 3.7 |
| 4 | 10, 3rd Floor, Lakshmi Associates, Gandhi Baza... | Grand Village | No | No | 3.8/5 | 166 | Basavanagudi | Casual Dining | North Indian, Rajasthani | 600 | [('Rated 4.0', 'RATED\n Very good restaurant ... | [] | Buffet | Banashankari | 3.8 |
df.rating.dtypes # rating has been successfully converted to float
dtype('float64')
#droping old rate column which is no more necessary
df.drop('rate', axis = 1, inplace = True)
df.head()
| address | name | online_order | book_table | votes | location | rest_type | cuisines | approx_cost | reviews_list | menu_item | type | listed_city | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 942, 21st Main Road, 2nd Stage, Banashankari, ... | Jalsa | Yes | Yes | 775 | Banashankari | Casual Dining | North Indian, Mughlai, Chinese | 800 | [('Rated 4.0', 'RATED\n A beautiful place to ... | [] | Buffet | Banashankari | 4.1 |
| 1 | 2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ... | Spice Elephant | Yes | No | 787 | Banashankari | Casual Dining | Chinese, North Indian, Thai | 800 | [('Rated 4.0', 'RATED\n Had been here for din... | [] | Buffet | Banashankari | 4.1 |
| 2 | 1112, Next to KIMS Medical College, 17th Cross... | San Churro Cafe | Yes | No | 918 | Banashankari | Cafe, Casual Dining | Cafe, Mexican, Italian | 800 | [('Rated 3.0', "RATED\n Ambience is not that ... | [] | Buffet | Banashankari | 3.8 |
| 3 | 1st Floor, Annakuteera, 3rd Stage, Banashankar... | Addhuri Udupi Bhojana | No | No | 88 | Banashankari | Quick Bites | South Indian, North Indian | 300 | [('Rated 4.0', "RATED\n Great food and proper... | [] | Buffet | Banashankari | 3.7 |
| 4 | 10, 3rd Floor, Lakshmi Associates, Gandhi Baza... | Grand Village | No | No | 166 | Basavanagudi | Casual Dining | North Indian, Rajasthani | 600 | [('Rated 4.0', 'RATED\n Very good restaurant ... | [] | Buffet | Banashankari | 3.8 |
# datatype of approx_cost is also object which is not the right datatype
df.approx_cost.dtypes
dtype('O')
#Changing the datatype of approx_cost column from object to float64
df['approx_cost'] = df['approx_cost'].astype(str) #to convert to string object to replace ','
df['approx_cost'] = df['approx_cost'].apply(lambda x: x.replace(',', '')) # removing ',' from the string
df['approx_cost'] = df['approx_cost'].astype(float) # converting to float64
df.approx_cost.dtypes
dtype('float64')
viz2 = go.Bar(
x = df.location.value_counts().keys(),
y = df.location.value_counts(),
name = "location",
text = None)
data2 = [viz2]
layout = go.Layout(title = 'Restaurant Distribution by Location',
barmode = "group",
yaxis=dict(title= 'Number of Restaurants'))
fig = go.Figure(data = data2, layout = layout)
iplot(fig, filename = 'plot2')
We see that 'BTM' has the highest number of restaurants followed by 'HSR' and 'Koramangala' all are know for their closeness to IT hubs and Outer ringroad. we can say IT industry has been the driving force of restaurant industry in Benagaluru
viz1 = go.Bar(
x = df['type'].value_counts().keys(),
y = df['type'].value_counts(),
name = "type",
text = None)
data1 = [viz1]
layout = go.Layout(title = 'Restaurant Distribution by Type of Service Offered',
barmode = "group",
yaxis=dict(title= 'Number of Restaurants'))
fig = go.Figure(data = data1, layout = layout)
iplot(fig, filename = 'plot_1')
We can see that majority restaurants are supporting food delivery, which is an indication of growing online trend
viz3 = go.Bar(
x = df['rest_type'].value_counts().head(15).keys(),
y = df['rest_type'].value_counts().head(15),
name = "rest_type",
text = None)
data3 = [viz3]
layout = go.Layout(title = 'Restaurant Distribution by Sub-Categories',
barmode = "group",
yaxis=dict(title= 'Number of Restaurants'))
fig = go.Figure(data = data3, layout = layout)
iplot(fig, filename = 'plot3')
'Quick-bites' category has the majority share, which indicates growing fast food culture in the city
plt.figure(figsize=(10,10))
x = df['online_order'].value_counts()
plt.pie(x,labels=['Accepted','Not Accepted'],autopct='%1.1f%%',textprops = {'fontsize': 16}, colors = ['#99ff99', '#ff8080'])
circle = plt.Circle( (0,0), 0.8, color='white')
p=plt.gcf()
p.gca().add_artist(circle)
plt.title("Online Order Service Offered", fontsize = 25)
plt.show()
We can see almost 58% of restaurants are accepting online orders. There is still around 40% restaurants that have not yet adopted online delivery service. So there is still potential for online delivery service service companies to penetrate the market
plt.figure(figsize=(10,10))
x = df['book_table'].value_counts()
plt.pie(x,labels=['Yes','No'],autopct='%1.1f%%',textprops = {'fontsize': 16}, colors = ['darkslateblue', 'lightsalmon'])
circle = plt.Circle( (0,0), 0.8, color='white')
p=plt.gcf()
p.gca().add_artist(circle)
plt.title("Dine-In Service Offered", fontsize = 25)
plt.show()
We can see 87.5% of the restaurants offer Dine in, it can be conluded by comparing above two charts that majority restaurants are offering both online and dine in service. and only about 12.5% of restaurant are online service. They may represent cloud kitchens.
We can see that there is a huge market opportunity in cloud kitchen service still untapped. and competion is not so much in this segment yet
restaurant_chains = df['name'].value_counts().to_frame()
restaurant_chains['Restaurant Name'] = restaurant_chains.index
restaurant_chains = restaurant_chains.reset_index(drop = True)
restaurant_chains.rename(columns={'name':'Count'}, inplace=True)
restaurant_chains.reset_index(drop = True)
restaurant_chains = restaurant_chains.loc[:, ['Restaurant Name', 'Count']]
restaurant_chains.head(10)
| Restaurant Name | Count | |
|---|---|---|
| 0 | Cafe Coffee Day | 96 |
| 1 | Onesta | 85 |
| 2 | Just Bake | 73 |
| 3 | Empire Restaurant | 71 |
| 4 | Five Star Chicken | 70 |
| 5 | Kanti Sweets | 68 |
| 6 | Petoo | 66 |
| 7 | Polar Bear | 65 |
| 8 | Baskin Robbins | 64 |
| 9 | Chef Baker's | 62 |
'Cafe coffee day' has the highest number of outlets followed by 'Onesta' and 'Just Bake', this shows the love for Bengalureans towards coffee, pizza and deserts
px.bar(restaurant_chains.head(10) , x='Restaurant Name', y='Count', title = "Top Restaurant Chains In Bengaluru")
quick_bite = df[df['rest_type'] == 'Quick Bites'].name.value_counts().head(10).to_frame()
quick_bite['Restaurant Name'] = quick_bite.index
quick_bite = quick_bite.reset_index(drop = True)
quick_bite.rename(columns={'name':'Count'}, inplace=True)
quick_bite.reset_index(drop = True)
quick_bite = quick_bite.loc[:, ['Restaurant Name', 'Count']]
quick_bite
| Restaurant Name | Count | |
|---|---|---|
| 0 | Five Star Chicken | 69 |
| 1 | Domino's Pizza | 60 |
| 2 | McDonald's | 59 |
| 3 | KFC | 56 |
| 4 | Ambur Hot Dum Biryani | 53 |
| 5 | Rolls On Wheels | 51 |
| 6 | Burger King | 51 |
| 7 | Pizza Stop | 48 |
| 8 | Goli Vada Pav No. 1 | 46 |
| 9 | Subway | 45 |
px.bar(quick_bite, x='Restaurant Name', y='Count', title = "Top Quick Bite Places")
Looks like 'five star chicken' is leading the quick-bites category followed by "Domino's" and "McDonald's"
looks like multi national food service gaints are leading this category. Domino's which is an indian company stands second in the segment
casual_dining = df[df['rest_type'] == 'Casual Dining'].name.value_counts().head(10).to_frame()
casual_dining['Restaurant Name'] = casual_dining.index
casual_dining = casual_dining.reset_index(drop = True)
casual_dining.rename(columns={'name':'Count'}, inplace=True)
casual_dining.reset_index(drop = True)
casual_dining = casual_dining.loc[:, ['Restaurant Name', 'Count']]
casual_dining
| Restaurant Name | Count | |
|---|---|---|
| 0 | Empire Restaurant | 58 |
| 1 | Beijing Bites | 48 |
| 2 | Mani's Dum Biryani | 47 |
| 3 | Chung Wah | 46 |
| 4 | Barbeque Nation | 41 |
| 5 | Toscano | 41 |
| 6 | Oye Amritsar | 41 |
| 7 | A2B - Adyar Ananda Bhavan | 39 |
| 8 | New Prashanth Hotel | 38 |
| 9 | Pizza Hut | 38 |
px.bar(casual_dining, x='Restaurant Name', y='Count', title = "Top Casual Dining Places")
'Empire Restaurant', 'Beijing Bites', 'Mani's Dum Biryani' are the top 3 restaurants in casual dining category. this shows the Bengalureans love towards biriyani and chinese cusinies
popular_franchises = df.groupby(by='name', as_index=False).agg({'votes': 'sum','address': 'count','approx_cost': 'mean', 'rating': 'mean'})
popular_franchises.columns = ['name', 'total_votes', 'total_units', 'mean_approx_cost', 'mean_rating'] #renaming the columns appropriately
popular_franchises = popular_franchises.sort_values(by='mean_rating', ascending=False)
# organising the dataframe
popular_franchises = popular_franchises.loc[:, ['name', 'total_units', 'total_votes',
'mean_approx_cost', 'mean_rating']]
popular_franchises.head()
| name | total_units | total_votes | mean_approx_cost | mean_rating | |
|---|---|---|---|---|---|
| 597 | Asia Kitchen By Mainland China | 19 | 42273 | 1500.0 | 4.900000 |
| 1274 | Byg Brewski Brewing Company | 6 | 99531 | 1600.0 | 4.900000 |
| 6552 | SantÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃ... | 1 | 246 | 1000.0 | 4.900000 |
| 5927 | Punjab Grill | 7 | 9660 | 2000.0 | 4.871429 |
| 865 | Belgian Waffle Factory | 29 | 24882 | 400.0 | 4.844828 |
There seems to be a error in name of the restaurant, lets try to correct it
# locating the name with the index
popular_franchises.loc[6552]['name']
'SantÃ\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82© Spa Cuisine'
Looks like original name is 'Santa Spa Cuisine'
# lets replace with the correct name
a = 'SantÃ\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x83Ã\x83Â\x83Ã\x82Â\x83Ã\x83Â\x82Ã\x82Â\x82Ã\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82© Spa Cuisine'
popular_franchises['name'] = popular_franchises['name'].apply(lambda x: 'Santa Spa Cusisine' if x == a else x)
# top 10 highest rated restaurants
print("Top 10 Highest Rated Restaurant Chains")
popular_franchises.head(10)
Top 10 Highest Rated Restaurant Chains
| name | total_units | total_votes | mean_approx_cost | mean_rating | |
|---|---|---|---|---|---|
| 597 | Asia Kitchen By Mainland China | 19 | 42273 | 1500.000000 | 4.900000 |
| 1274 | Byg Brewski Brewing Company | 6 | 99531 | 1600.000000 | 4.900000 |
| 6552 | Santa Spa Cusisine | 1 | 246 | 1000.000000 | 4.900000 |
| 5927 | Punjab Grill | 7 | 9660 | 2000.000000 | 4.871429 |
| 865 | Belgian Waffle Factory | 29 | 24882 | 400.000000 | 4.844828 |
| 2598 | Flechazo | 6 | 29956 | 1400.000000 | 4.800000 |
| 5471 | O.G. Variar & Sons | 2 | 2317 | 200.000000 | 4.800000 |
| 8035 | The Pizza Bakery | 6 | 10523 | 1200.000000 | 4.800000 |
| 129 | AB's - Absolute Barbecues | 23 | 86418 | 1568.421053 | 4.789474 |
| 964 | Biergarten | 18 | 39843 | 2183.333333 | 4.766667 |
not_popular_franchises = popular_franchises.sort_values(by='mean_rating', ascending = True)
print("10 Lowest rated Restaurant Chains")
not_popular_franchises.head(10)
10 Lowest rated Restaurant Chains
| name | total_units | total_votes | mean_approx_cost | mean_rating | |
|---|---|---|---|---|---|
| 316 | Alibi - Maya International Hotel | 5 | 1123 | 1200.0 | 1.800000 |
| 2778 | Fusion Lounge | 9 | 3573 | 1500.0 | 2.000000 |
| 2112 | Decker's Lane | 4 | 970 | 400.0 | 2.100000 |
| 705 | Bageecha | 4 | 1916 | 650.0 | 2.150000 |
| 4636 | Mamma Mexicana | 13 | 5296 | 1000.0 | 2.200000 |
| 7584 | Taste Of Kerala | 5 | 240 | 600.0 | 2.240000 |
| 911 | Bhagini | 12 | 1683 | 800.0 | 2.283333 |
| 4788 | Meghana Biryani | 5 | 1185 | 800.0 | 2.300000 |
| 1044 | Biryani Junction | 9 | 2682 | 450.0 | 2.300000 |
| 8422 | Vande Matharam | 4 | 716 | 500.0 | 2.300000 |
cuisines = df['cuisines'].value_counts().head(20).to_frame()
cuisines['Cuisine names'] = cuisines.index
cuisines = cuisines.reset_index(drop = True)
cuisines.rename(columns={'cuisines':'Count'}, inplace=True)
cuisines = cuisines.loc[:, ['Cuisine names', 'Count']]
cuisines
| Cuisine names | Count | |
|---|---|---|
| 0 | North Indian | 2913 |
| 1 | North Indian, Chinese | 2385 |
| 2 | South Indian | 1828 |
| 3 | Biryani | 918 |
| 4 | Bakery, Desserts | 911 |
| 5 | Fast Food | 803 |
| 6 | Desserts | 766 |
| 7 | Cafe | 756 |
| 8 | South Indian, North Indian, Chinese | 726 |
| 9 | Bakery | 651 |
| 10 | Chinese | 556 |
| 11 | Ice Cream, Desserts | 417 |
| 12 | Chinese, North Indian | 415 |
| 13 | Mithai, Street Food | 372 |
| 14 | Desserts, Ice Cream | 354 |
| 15 | North Indian, Chinese, Biryani | 352 |
| 16 | North Indian, South Indian | 343 |
| 17 | South Indian, North Indian | 343 |
| 18 | North Indian, South Indian, Chinese | 305 |
| 19 | Beverages | 301 |
px.bar(cuisines, x="Cuisine names", y="Count", title = "Top Cuisines Bengalureans Prefer")
plt.figure(figsize=(10,7))
rating = df['rating'].value_counts()
sns.barplot(x= rating.index,y=rating)
plt.xlabel("Ratings", fontsize = 12)
plt.ylabel('Count',fontsize = 12)
plt.title("Distribution of Ratings", fontsize = 18)
plt.show()
From the visualization we can infer 3.9 is most common rating
plt.figure(figsize=(10,8))
sns.histplot(df.approx_cost,bins = 50,kde = False)
plt.title("Distribution Plot of Approximate Cost", fontsize = 15)
plt.xlabel("Approximate Cost for 2 People",fontsize = 12)
plt.ylabel('Count',fontsize = 12)
plt.show()
From the above plot we can conclude it is going to cost an approximate 500/- INR for two people in majority of the restaurants in Bengaluru
df_cheap = df[['name','approx_cost','location','rest_type','cuisines']].groupby(['approx_cost'], sort = True)
df_cheap = df_cheap.filter(lambda x: x.mean() <= 1500)
df_cheap = df_cheap.sort_values(by=['approx_cost'])
df_cheap
| name | approx_cost | location | rest_type | cuisines | |
|---|---|---|---|---|---|
| 18891 | Srinidhi Sagar Food Line | 40.0 | Indiranagar | Quick Bites | South Indian, North Indian, Chinese |
| 32485 | Srinidhi Sagar | 40.0 | Old Airport Road | Quick Bites | South Indian, North Indian, Chinese |
| 5270 | Srinidhi Sagar Food Line | 40.0 | Indiranagar | Quick Bites | South Indian, North Indian, Chinese |
| 14819 | Srinidhi Sagar Food Line | 40.0 | Indiranagar | Quick Bites | South Indian, North Indian, Chinese |
| 27091 | Srinidhi Sagar Deluxe | 40.0 | Domlur | Quick Bites | South Indian, North Indian, Chinese |
| ... | ... | ... | ... | ... | ... |
| 5516 | Feast India Co. | 1500.0 | Cunningham Road | Casual Dining | Awadhi, Bengali, North Indian |
| 19279 | Barebones - The Balcony Bar | 1500.0 | Indiranagar | Bar, Casual Dining | Chinese, Continental, Finger Food, Italian |
| 19278 | Lono | 1500.0 | Indiranagar | Lounge, Casual Dining | Continental, Modern Indian |
| 6103 | Chianti | 1500.0 | MG Road | Casual Dining | Italian |
| 51716 | The Nest - The Den Bengaluru | 1500.0 | ITPL Main Road, Whitefield | Bar, Casual Dining | Finger Food, North Indian, Continental |
49592 rows × 5 columns
df_expen = df[['name','approx_cost','location','rest_type','cuisines']].groupby(['approx_cost'])
df_expen = df_expen.filter(lambda x: x.mean() >= 3000)
df_expen = df_expen.sort_values(by=['approx_cost'])
df_expen
| name | approx_cost | location | rest_type | cuisines | |
|---|---|---|---|---|---|
| 4983 | Rim Naam - The Oberoi | 3000.0 | MG Road | Fine Dining | Thai |
| 39120 | Ssaffron - Shangri-La Hotel | 3000.0 | Vasanth Nagar | Fine Dining, Bar | North Indian |
| 39127 | Hype - Shangri-La Hotel | 3000.0 | Vasanth Nagar | Lounge, Bar | Finger Food |
| 39138 | Mikusu - Conrad Bengaluru | 3000.0 | Ulsoor | Fine Dining | Japanese, Chinese, Thai |
| 39150 | Indian Durbar - Conrad Bengaluru | 3000.0 | Ulsoor | Fine Dining | North Indian |
| ... | ... | ... | ... | ... | ... |
| 42141 | Malties - Radisson Blu | 4500.0 | Marathahalli | Lounge | Continental, Fast Food |
| 41591 | Malties - Radisson Blu | 4500.0 | Marathahalli | Lounge | Continental, Fast Food |
| 40266 | Royal Afghan - ITC Windsor | 5000.0 | Sankey Road | Fine Dining | North Indian, Mughlai |
| 45618 | Le Cirque Signature - The Leela Palace | 6000.0 | Old Airport Road | Fine Dining | French, Italian |
| 19139 | Le Cirque Signature - The Leela Palace | 6000.0 | Old Airport Road | Fine Dining | French, Italian |
241 rows × 5 columns
df_rating = df[['name','rating']].groupby(['rating'])
df_rating = df_rating.filter(lambda x: x.mean() >= 4.5) # above returns a groupby object hence using labmda function
df_rating = df_rating.sort_values(by=['rating'],ascending = False)
df_rating.head(10)
| name | rating | |
|---|---|---|
| 43055 | Belgian Waffle Factory | 4.9 |
| 49170 | Byg Brewski Brewing Company | 4.9 |
| 50059 | Byg Brewski Brewing Company | 4.9 |
| 3921 | Byg Brewski Brewing Company | 4.9 |
| 42381 | Belgian Waffle Factory | 4.9 |
| 36684 | Asia Kitchen By Mainland China | 4.9 |
| 11504 | Asia Kitchen By Mainland China | 4.9 |
| 4801 | Byg Brewski Brewing Company | 4.9 |
| 9099 | Asia Kitchen By Mainland China | 4.9 |
| 49627 | Byg Brewski Brewing Company | 4.9 |
cheap_and_best = pd.merge(df_cheap, df_rating, how='inner', on=['name'])
expensive_and_best = pd.merge(df_expen, df_rating, how='inner', on=['name'])
print("Top 10 High Rated Restaurants When under Budget")
cheap_and_best.head(10)
Top 10 High Rated Restaurants When under Budget
| name | approx_cost | location | rest_type | cuisines | rating | |
|---|---|---|---|---|---|---|
| 0 | Brahmin's Coffee Bar | 100.0 | Basavanagudi | Quick Bites | South Indian | 4.8 |
| 1 | Brahmin's Coffee Bar | 250.0 | Malleshwaram | Quick Bites | South Indian | 4.8 |
| 2 | Taaza Thindi | 100.0 | Banashankari | Quick Bites | South Indian | 4.7 |
| 3 | Natural Ice Cream | 150.0 | Jayanagar | Dessert Parlor | Ice Cream, Desserts | 4.6 |
| 4 | Natural Ice Cream | 150.0 | Jayanagar | Dessert Parlor | Ice Cream, Desserts | 4.6 |
| 5 | Natural Ice Cream | 150.0 | Jayanagar | Dessert Parlor | Ice Cream, Desserts | 4.5 |
| 6 | Natural Ice Cream | 150.0 | Jayanagar | Dessert Parlor | Ice Cream, Desserts | 4.5 |
| 7 | Natural Ice Cream | 150.0 | Jayanagar | Dessert Parlor | Ice Cream, Desserts | 4.5 |
| 8 | Natural Ice Cream | 150.0 | Jayanagar | Dessert Parlor | Ice Cream, Desserts | 4.5 |
| 9 | Natural Ice Cream | 150.0 | Jayanagar | Dessert Parlor | Ice Cream, Desserts | 4.5 |
print("Top 10 High Rated Restaurants For Fine Dining")
expensive_and_best.value_counts()
Top 10 High Rated Restaurants For Fine Dining
name approx_cost location rest_type cuisines rating Rim Naam - The Oberoi 3000.0 MG Road Fine Dining Thai 4.6 144 Karavalli - The Gateway Hotel 3500.0 Residency Road Fine Dining Mangalorean, Konkan, Seafood, Kerala 4.5 25 Alba - JW Marriott Bengaluru 4000.0 Lavelle Road Fine Dining Italian 4.5 4 dtype: int64
top_cheap_locations = cheap_and_best["location"].value_counts().to_frame()
top_cheap_locations = top_cheap_locations.reset_index(level=0)
top_cheap_locations.rename(columns={'index':'location', 'location': 'Count'}, inplace=True)
top_cheap_locations.head(5)
| location | Count | |
|---|---|---|
| 0 | Koramangala 5th Block | 5142 |
| 1 | Cunningham Road | 1934 |
| 2 | Indiranagar | 1290 |
| 3 | Koramangala 7th Block | 1161 |
| 4 | BTM | 916 |
px.bar(top_cheap_locations.head(5), x = "location" , y = "Count", title = "Top Locations for Low Price, High Rated Restaurants")
top_expens_locations = expensive_and_best["location"].value_counts().to_frame()
top_expens_locations = top_expens_locations.reset_index(level=0)
top_expens_locations.rename(columns={'index':'location', 'location': 'Count'}, inplace=True)
top_expens_locations.head(10)
| location | Count | |
|---|---|---|
| 0 | MG Road | 144 |
| 1 | Residency Road | 25 |
| 2 | Lavelle Road | 4 |
px.bar(top_expens_locations.head(10), x = 'location' , y = "Count", title = "Top Locations for Expensive, High Rated Restaurants")
location_cost = df.groupby(by='location', as_index=False).agg({'approx_cost': 'mean'})
location_cost.columns = ['location', 'mean_approx_cost']
location_cost = location_cost.sort_values(by='mean_approx_cost', ascending=False)
location_cost.head(10)
| location | mean_approx_cost | |
|---|---|---|
| 75 | Sankey Road | 2505.555556 |
| 66 | Race Course Road | 1309.352518 |
| 51 | Lavelle Road | 1307.934990 |
| 52 | MG Road | 1155.704698 |
| 28 | Infantry Road | 1062.251656 |
| 70 | Residency Road | 966.320475 |
| 50 | Langford Town | 883.333333 |
| 81 | St. Marks Road | 871.306818 |
| 15 | Cunningham Road | 864.969450 |
| 12 | Church Street | 834.885764 |
px.bar(location_cost, x = 'location' , y = "mean_approx_cost", title = "Approximate Cost By Location", labels = {'location' : 'Location', "mean_approx_cost" : 'Approximate cost for 2 people'} )
from geopy.geocoders import Nominatim
import folium
from folium.plugins import HeatMap
from folium.plugins import FastMarkerCluster
len(df['location'].unique())
94
locations = pd.DataFrame({"Name" : df.location.unique()})
geolocation = Nominatim(user_agent= "app")
locations
| Name | |
|---|---|
| 0 | Banashankari |
| 1 | Basavanagudi |
| 2 | Mysore Road |
| 3 | Jayanagar |
| 4 | Kumaraswamy Layout |
| ... | ... |
| 89 | West Bangalore |
| 90 | Magadi Road |
| 91 | Yelahanka |
| 92 | Sahakara Nagar |
| 93 | Peenya |
94 rows × 1 columns
lat=[]
lon=[]
for i in locations['Name']:
i = geolocation.geocode(i)
if i is None:
lat.append(np.nan)
lon.append(np.nan)
else:
lat.append(i.latitude)
lon.append(i.longitude)
locations["latitude"] = lat
locations["longitude"] = lon
locations.head()
| Name | latitude | longitude | |
|---|---|---|---|
| 0 | Banashankari | 15.887678 | 75.704678 |
| 1 | Basavanagudi | 12.941726 | 77.575502 |
| 2 | Mysore Road | 12.387214 | 76.666963 |
| 3 | Jayanagar | 27.643927 | 83.052805 |
| 4 | Kumaraswamy Layout | 12.908149 | 77.555318 |
Restaurant_locations = pd.DataFrame(df['location'].value_counts().reset_index())
Restaurant_locations.columns = ['Name','count']
Restaurant_locations.head()
| Name | count | |
|---|---|---|
| 0 | BTM | 5124 |
| 1 | HSR | 2523 |
| 2 | Koramangala 5th Block | 2504 |
| 3 | JP Nagar | 2235 |
| 4 | Whitefield | 2144 |
print(locations.shape)
print(Restaurant_locations.shape)
(94, 3) (93, 2)
def generateBaseMap(default_location=[12.97, 77.59], default_zoom_start=12):
base_map = folium.Map(location=default_location, zoom_start=default_zoom_start)
return base_map
basemap = generateBaseMap()
print("Base Map of Bengaluru")
basemap
Base Map of Bengaluru
Restaurant_locations = Restaurant_locations.merge(locations,on='Name',how="left").dropna()
Restaurant_locations.head()
| Name | count | latitude | longitude | |
|---|---|---|---|---|
| 0 | BTM | 5124 | 45.954851 | -112.496595 |
| 1 | HSR | 2523 | 18.147500 | 41.538889 |
| 2 | Koramangala 5th Block | 2504 | 12.934377 | 77.628415 |
| 3 | JP Nagar | 2235 | 12.265594 | 76.646540 |
| 4 | Whitefield | 2144 | 44.373058 | -71.611858 |
HeatMap(Restaurant_locations[['latitude','longitude','count']],zoom = 25,radius = 20).add_to(basemap)
print("Heatmap of Restaurants Distribution in Bengaluru")
basemap
Heatmap of Restaurants Distribution in Bengaluru
We can see that most restaurants are concentrated on the south eastern part of the Bengaluru. Also, there are quite a lot of areas around the central Bengaluru where restaurant concentration are not that high, these might be potential locations for new restaurants looking to capture the market without facing a lot of competition
from textblob import TextBlob
sentiment = []
for i in range(0,len(df)):
x = TextBlob(df.loc[i,"reviews_list"])
if x.sentiment.polarity > 0:
sentiment.append("positive")
elif x.sentiment.polarity == 0:
sentiment.append("neutral")
else:
sentiment.append("negative")
df_sent = df.copy()
df_sent["reviews_list"] = sentiment
df_sent.head()
| address | name | online_order | book_table | votes | location | rest_type | cuisines | approx_cost | reviews_list | menu_item | type | listed_city | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 942, 21st Main Road, 2nd Stage, Banashankari, ... | Jalsa | Yes | Yes | 775 | Banashankari | Casual Dining | North Indian, Mughlai, Chinese | 800.0 | positive | [] | Buffet | Banashankari | 4.1 |
| 1 | 2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ... | Spice Elephant | Yes | No | 787 | Banashankari | Casual Dining | Chinese, North Indian, Thai | 800.0 | positive | [] | Buffet | Banashankari | 4.1 |
| 2 | 1112, Next to KIMS Medical College, 17th Cross... | San Churro Cafe | Yes | No | 918 | Banashankari | Cafe, Casual Dining | Cafe, Mexican, Italian | 800.0 | positive | [] | Buffet | Banashankari | 3.8 |
| 3 | 1st Floor, Annakuteera, 3rd Stage, Banashankar... | Addhuri Udupi Bhojana | No | No | 88 | Banashankari | Quick Bites | South Indian, North Indian | 300.0 | positive | [] | Buffet | Banashankari | 3.7 |
| 4 | 10, 3rd Floor, Lakshmi Associates, Gandhi Baza... | Grand Village | No | No | 166 | Basavanagudi | Casual Dining | North Indian, Rajasthani | 600.0 | positive | [] | Buffet | Banashankari | 3.8 |
positive = df_sent[df_sent["reviews_list"] == "positive"]
negative = df_sent[df_sent["reviews_list"] == "negative"]
neutral = df_sent[df_sent["reviews_list"] == "neutral"]
print("Restaurants having positive reviews: ",len(positive))
Restaurants having positive reviews: 39629
print("Restaurants having negative reviews: ",len(negative))
Restaurants having negative reviews: 4324
print("Restaurants having neutral reviews: ",len(neutral))
Restaurants having neutral reviews: 7764
df_sentiment = df_sent["reviews_list"].value_counts().to_frame()
df_sentiment = df_sentiment.reset_index()
df_sentiment = df_sentiment.rename(columns = {'index':'sentiment', 'reviews_list': 'Count'})
df_sentiment
| sentiment | Count | |
|---|---|---|
| 0 | positive | 39629 |
| 1 | neutral | 7764 |
| 2 | negative | 4324 |
px.bar(df_sentiment, x = 'sentiment' , y = "Count", title = "Rating Sentiment Distribution", color = 'sentiment',color_discrete_map={'positive':'green',
'neutral':'lightblue',
'negative':'red',})
The data reveals most restaurants are having a positive rating which is a good news for the restaurant industry
df_model = df_sent.copy()
df_model['rating'] = df_model['rating'].replace(np.nan, 0)
df_model.dropna(axis=0, inplace = True)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# label encoding to to convert categorical variables into the machine-readable form for certain algorithms
x = ['online_order','book_table','rest_type','cuisines','listed_city', 'location', 'type', 'reviews_list']
for i in x:
df_model[i] = le.fit_transform(df_model[i])
df_model.head()
| address | name | online_order | book_table | votes | location | rest_type | cuisines | approx_cost | reviews_list | menu_item | type | listed_city | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 942, 21st Main Road, 2nd Stage, Banashankari, ... | Jalsa | 1 | 1 | 775 | 1 | 27 | 2145 | 800.0 | 2 | [] | 0 | 1 | 4.1 |
| 1 | 2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ... | Spice Elephant | 1 | 0 | 787 | 1 | 27 | 947 | 800.0 | 2 | [] | 0 | 1 | 4.1 |
| 2 | 1112, Next to KIMS Medical College, 17th Cross... | San Churro Cafe | 1 | 0 | 918 | 1 | 22 | 761 | 800.0 | 2 | [] | 0 | 1 | 3.8 |
| 3 | 1st Floor, Annakuteera, 3rd Stage, Banashankar... | Addhuri Udupi Bhojana | 0 | 0 | 88 | 1 | 78 | 2539 | 300.0 | 2 | [] | 0 | 1 | 3.7 |
| 4 | 10, 3rd Floor, Lakshmi Associates, Gandhi Baza... | Grand Village | 0 | 0 | 166 | 4 | 27 | 2174 | 600.0 | 2 | [] | 0 | 1 | 3.8 |
df_model.drop(columns = ['address','menu_item', 'name'], axis = 1, inplace = True)
df_model.shape
(51148, 11)
features1 = df_model.drop(['rating'], axis = 1)
target = df_model['rating']
# Scaling the features to bring various features under a similar scale and improve model performance
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features = scaler.fit_transform(features1)
features
array([[ 0.82912014, 2.63270813, 0.60767491, ..., 0.51104057,
-2.46134804, -1.59071902],
[ 0.82912014, -0.37983702, 0.62255425, ..., 0.51104057,
-2.46134804, -1.59071902],
[ 0.82912014, -0.37983702, 0.78498706, ..., 0.51104057,
-2.46134804, -1.59071902],
...,
[-1.20609783, -0.37983702, -0.35328253, ..., -1.1055531 ,
2.79876977, 1.78898279],
[-1.20609783, 2.63270813, -0.06065549, ..., 0.51104057,
2.79876977, 1.78898279],
[-1.20609783, -0.37983702, -0.33716325, ..., 0.51104057,
2.79876977, 1.78898279]])
Let's split the data into training and test set. We will be using 80% of data for training and 20% of data as test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features , target, test_size=0.20, random_state=123)
# Check the shape of splited data
x_train.shape,x_test.shape,y_train.shape,y_test.shape
((40918, 10), (10230, 10), (40918,), (10230,))
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
y_predict_lr = lr.predict(x_test)
from sklearn.metrics import r2_score
r2_score(y_test,y_predict_lr)
0.3415262784393306
The model accuracy for linear regression model is very low. Hyperparameter Tuning techniques like Lasso and Ridge may not improve the model performance drastically. Hence, lets check out other model and decide on the hyper parameter tuning
from sklearn.tree import DecisionTreeRegressor
dtree = DecisionTreeRegressor()
dtree.fit(x_train,y_train)
y_predict_dt = dtree.predict(x_test)
r2_score(y_test,y_predict_dt)
0.9808213833054206
The tree based model is performing well in terms of R squared score. lets us use an ensemble learning to boost the performance further
#Preparing Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
r_forest = RandomForestRegressor(n_estimators=200, random_state = 123)
r_forest.fit(x_train,y_train)
y_predict_rf = r_forest.predict(x_test)
r2_score(y_test,y_predict_rf)
0.9896641961151734
Random Forest Regressor is performing better than decision tree regressor. lets check out other models as well
from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor()
gbr.fit(x_train, y_train)
y_predict_gbr = gbr.predict(x_test)
r2_score(y_test,y_predict_gbr)
0.9623416944579003
Performance of gradient boosting algorithm has degraded a little compared to random forest regressor. lets checkout Xtreme gradient boosting.
from xgboost import XGBRegressor
xgbr = XGBRegressor(booster = 'gbtree', learning_rate = 0.1, max_depth = 7, n_estimators = 200)
xgbr.fit(x_train, y_train)
y_predict_xgbr = xgbr.predict(x_test)
r2_score(y_test,y_predict_xgbr)
0.9768642352075912
all though XG boost regressor performed well compared to gradient boosting its score is still lesser than random forest regressor
models = pd.DataFrame({
'Model' : ['Linear Regression', 'Decision Tree', 'Random Forest', 'Gradient Boost', 'XgBoost'],
'R2_Score' : [lr.score(x_test, y_test), dtree.score(x_test, y_test), r_forest.score(x_test, y_test), gbr.score(x_test, y_test), xgbr.score(x_test, y_test)]
})
models.sort_values(by = 'R2_Score', ascending = False)
| Model | R2_Score | |
|---|---|---|
| 2 | Random Forest | 0.989664 |
| 1 | Decision Tree | 0.980821 |
| 4 | XgBoost | 0.976864 |
| 3 | Gradient Boost | 0.962342 |
| 0 | Linear Regression | 0.341526 |
As we can see Random Forest Regressor seem to be the best performing model. we will save this model for making further predictions
prediction_df = pd.DataFrame({"Actual Rating": y_test, "Predicted Rating": np.round(y_predict_rf,1)})
random_state = 43
prediction_df.sample(10)
| Actual Rating | Predicted Rating | |
|---|---|---|
| 37709 | 3.3 | 3.3 |
| 10420 | 4.1 | 4.0 |
| 7434 | 3.2 | 3.7 |
| 46091 | 0.0 | 0.0 |
| 40185 | 4.0 | 3.8 |
| 30663 | 0.0 | 0.0 |
| 26577 | 4.0 | 4.0 |
| 41542 | 4.6 | 4.6 |
| 46491 | 3.2 | 3.2 |
| 18955 | 3.3 | 3.3 |
import pickle
filename = 'final_model.sav'
pickle.dump(r_forest, open(filename, 'wb'))
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(x_test, y_test)
print(result)
0.9896641961151734
# creating a copy of the data
df_recommender = df.sample(frac=0.5)
df_recommender.shape
(25858, 14)
# droppping unwanted columns
df_recommender = df_recommender.drop(['address','rest_type', 'type', 'menu_item', 'votes'],axis=1)
df_recommender.head()
| name | online_order | book_table | location | cuisines | approx_cost | reviews_list | listed_city | rating | |
|---|---|---|---|---|---|---|---|---|---|
| 49709 | Punjabi Last Stand | No | No | Sarjapur Road | North Indian | 300.0 | [] | Sarjapur Road | NaN |
| 2780 | Thyme & Whisk | Yes | No | Jayanagar | Asian, Chinese, Continental, Italian | 800.0 | [('Rated 2.0', 'RATED\n For starters, we orde... | Basavanagudi | NaN |
| 13503 | Sri Krishna Sagar | Yes | No | Electronic City | South Indian, North Indian | 300.0 | [('Rated 5.0', 'RATED\n Tasty food and good a... | Electronic City | 3.5 |
| 26424 | Ambur Khousia Biryani Corner | No | No | Banaswadi | Biryani | 100.0 | [] | Kammanahalli | NaN |
| 6242 | Strikers Sports Bar and Music | No | Yes | Shanti Nagar | Continental, North Indian, Chinese | 1200.0 | [('Rated 1.0', "RATED\n Dikhave Pe Mat Jao, A... | Brigade Road | 3.8 |
df_recommender.set_index('name', inplace=True)
indices = pd.Series(df_recommender.index)
indices
0 Punjabi Last Stand
1 Thyme & Whisk
2 Sri Krishna Sagar
3 Ambur Khousia Biryani Corner
4 Strikers Sports Bar and Music
...
25853 Calcutta Victoria Chat House
25854 Cake N Cookies
25855 Burger King
25856 Momo Time
25857 Miss Momo
Name: name, Length: 25858, dtype: object
import re
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_recommender['reviews_list'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
def recommend(name, cosine_similarities = cosine_similarities):
recommend_restaurant = [] # Create a list to put top restaurants
idx = indices[indices == name].index[0] # Find the index of the hotel entered
# Find the restaurants with a similar cosine-sim value and order them from biggest number
score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending=False)
# Extract top 30 restaurant indexes with a similar cosine-sim value
top30_indexes = list(score_series.iloc[0:31].index)
# Names of the top 30 restaurants
for each in top30_indexes:
recommend_restaurant.append(list(df_recommender.index)[each])
# Creating the new data set to show similar restaurants
df_new = pd.DataFrame(columns=['cuisines', 'rating', 'approx_cost'])
# Create the top 30 similar restaurants with some of their columns
for each in recommend_restaurant:
df_new = df_new.append(pd.DataFrame(df_recommender[['cuisines','rating', 'approx_cost', 'location']][df_recommender.index == each].sample()))
# Drop the same named restaurants and sort only the top 10 by the highest rating
df_new = df_new.drop_duplicates(subset=['cuisines', 'rating', 'approx_cost','location'], keep=False)
df_new = df_new.sort_values(by='rating', ascending=False).head(10)
print('TOP %s RESTAURANTS LIKE %s WITH SIMILAR REVIEWS: ' % (str(len(df_new)), name))
return df_new
def search_restaurant(restaurant_name):
x = df['name'].str.contains(restaurant_name, case = False, na = False)
return df[x].head()
search_restaurant('Absolute')
| address | name | online_order | book_table | votes | location | rest_type | cuisines | approx_cost | reviews_list | menu_item | type | listed_city | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3765 | 1st Floor, GRS Towers, Above Spencers Hyper Ma... | AB's - Absolute Barbecues | No | Yes | 881 | Sarjapur Road | Casual Dining | European, Mediterranean, North Indian, BBQ | 1600.0 | [('Rated 2.0', "RATED\n Looks AB is finding h... | [] | Buffet | Bellandur | 4.7 |
| 4141 | Behind Prestige Tech Park, Kadubisanahalli, Ma... | Absolute Shawarma | Yes | No | 116 | Marathahalli | Takeaway, Delivery | Lebanese | 150.0 | [('Rated 4.0', 'RATED\n Good food at a budget... | [] | Delivery | Bellandur | 3.8 |
| 4809 | 1st Floor, GRS Towers, Above Spencers Hyper Ma... | AB's - Absolute Barbecues | No | Yes | 881 | Sarjapur Road | Casual Dining | European, Mediterranean, North Indian, BBQ | 1600.0 | [('Rated 2.0', "RATED\n Looks AB is finding h... | [] | Dine-out | Bellandur | 4.7 |
| 6745 | 2nd Floor, I20-A2, EPIP Zone, Near Vydehi Hosp... | AB's - Absolute Barbecues | No | Yes | 2882 | Whitefield | Casual Dining | European, Mediterranean, North Indian, BBQ | 1600.0 | [('Rated 4.0', 'RATED\n Went today for Lunch ... | [] | Buffet | Brookefield | 4.8 |
| 7601 | 750/1, Near Salem Kitchen, ITPL Main Road, Kun... | Absolute Shawarma | Yes | No | 0 | Brookefield | Takeaway, Delivery | Lebanese | 150.0 | [('Rated 5.0', 'RATED\n The shawarma was real... | ['Arabian Combo', 'Mexican Combo', 'Arabian Sh... | Delivery | Brookefield | NaN |
recommend("AB's - Absolute Barbecues")
TOP 10 RESTAURANTS LIKE AB's - Absolute Barbecues WITH SIMILAR REVIEWS:
| cuisines | rating | approx_cost | location | |
|---|---|---|---|---|
| Flechazo | Asian, Mediterranean, North Indian, BBQ | 4.9 | 1400.0 | Whitefield |
| Byg Brewski Brewing Company | Continental, North Indian, Italian, South Indi... | 4.9 | 1600.0 | Sarjapur Road |
| AB's - Absolute Barbecues | European, Mediterranean, North Indian, BBQ | 4.8 | 1600.0 | Whitefield |
| The Black Pearl | North Indian, European, Mediterranean, BBQ | 4.8 | 1500.0 | Marathahalli |
| AB's - Absolute Barbecues | European, Mediterranean, North Indian, BBQ | 4.7 | 1600.0 | Sarjapur Road |
| The Black Pearl | North Indian, European, Mediterranean | 4.7 | 1400.0 | Koramangala 5th Block |
| Hammered | North Indian, Thai, Japanese, Continental, Cafe | 4.7 | 1300.0 | Cunningham Road |
| Buff Buffet Buff | North Indian, Chinese, Continental, Thai, Salad | 4.5 | 1500.0 | Koramangala 5th Block |
| Deja Vu Resto Bar | North Indian, Italian | 4.4 | 900.0 | Bannerghatta Road |
| Barbecoa | North Indian, Continental, Chettinad, Andhra, ... | 4.3 | 1000.0 | Marathahalli |
we have got a recommendation of top 10 similar restaurants based on their cosine score
In this project, an attempt has been made to understand the restaurant industry in Bengaluru and predict the rating of a restaurant with 98.9% accuracy using a random forest algorithm based on location, availability of online order service, availability of dine-in service, type of cuisines offered, type of service offered, number of votes received, the sentiment of the reviews. Also, a recommender was built based on NLP to suggest similar restaurants for the customers. This will empower restaurants with the data required to take decisions to make their business successful. Future possible research could make use of other significant factors which includes the foot traffic competition i.e the number of similar businesses that could impact the new business being established, accessibility, and average business rates that could be incurred for a particular type of restaurant. These above-mentioned factors could help the system make the analysis more accurate